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REMARKS 

Reconsideration and allowance in view of the foregoing amendment and the following 
remark's are respectfully requested. 

Rejection of Claims 22^25, 27. 29>32 and 34 Under 35 U,S,C $103(a) 

The Office Action rejects claims 22-25, 27, 29-32 and 34 under 35 § 103(a) as 

being unpatentable over Ezzat et al. (Visual Speech Synthesis; by Mbiphing Viisemes) C'Ezzat jet 
al.") in view of Jiang et al (Visual Speech Analysis with Application to Mandarin Speech 
Training) ("Jiang et aL'*) in view of Hon et al. (Automaitic Geriefatipn of Syntheisis: Units for 
Trainable Text-to-Speech Systems) CHon et al.")- Applicants thank the Examiner for the 
detailed discussion in the Advisory Action. Applicants maintain their position that under an 
appropriate obviousness analysis, that one of skill in the art would not have a sufficient amount 
of motivation (by a preponderance of the evidence) to modify the teachings of Ezzat et ah with 
the teachings of Hon et al 

Applicants first address some of the particular arguments in the Advisory Action.. 

First, Advisory Action states.in the discussion of Ezzat et al. and Hon et al. thai ''given 
that both of these references deal with generating audio as well as video streams : fas explained in 
previously written office actions), the Examiner does not see how one of skill in the art can find 
these references to be in such conflict as to suggest teachings away from one another. " 
(Emphasis added.) Applicants respeetfully correct the foundation of the Examiner's conclusion 
that: he cannot see how one of skill in the aft.epuld find conflict in these.. teachings. This 
paragraph asserts that previousAvritten office actions have established that Hon et al. teach 
generating audio "as well as video streams''. Applicants would surmise that the Examiner 
perhaps is mistakening the teachings of Hon et al. with Jiang et al. which do focus on an aspect 
of visual speech analysis and synthesis. However, Applicants strongly submit that Hon et al. taU 
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to leach anything dealing with video streams and exclusively focuses on audio in audio:synlhesis 
and in the Whistler Text-to-Speech engine. Accordingly, the Examiner cannot rely on this 
enoneous interpretation of Hon et al. as a foundation to, combine Hon et al. with Exzat et ah 
Applicants suppose that wdth a correct interpretation of the teachings of Hon et aL/that this 
certainly opens the way for the Examiner to conceive of why the suggestive power of each 
reference can teach away from their combination. Applicants shiall provide further details on 
why this issue is important in terms of whether it is obvious to combine these rel^rences 
inasmuch as the fact that Ezzat et aL utilize visual speech synthesis aiid Hon et. ah fail to do so. 

Furthermore, inasmuch as the Advisory Action uses an en*oneous characterization of Hon 
et al. as a foundation of the entire argument that these references are -similar in their scope", 
Applicants respectfully request either a Notice of Allowance based on previous arguments and 
arguments set forth herein or a non-final Ofiiee Action and a return of Our fee for. filing this 
RCE. 

The Advisoiy Action also characterizes in several places the teachings of Hon et al. 
incorrectly. For example, the second sentence of the last paragraph states ''However, in the 
context of Hon, the criticism of Hon^with diphones is referring to is: [sic] the selection or decision 
process used to match diphones in existing, diphone systents''^ citing the first paragraph under 
Section 2.1. We have discussed this patagraph at length and Applicants note thai this paraigtaph 
of Hon et al, does not simply refer to the "selection or deciision process" that is used to match 
diphones, but discusses the use of the diphone as a synthesis unit for cpncatenative synthesizers. 
Accordingly, the correct, interpretation of this paragraph is not that it is referring; to a ''process 
used to match diphones", but rather the difficulty in using diphones as the syntjiesisrunit in-a TTS 
system. For example, this paragraph of Hon et.al. teaches: "while diphones retain the transitional 
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information,.there ican be large distortions due to the difference in spectra between the stationary 

parts of two iinits obtained from different egntext." Several exaniples are then giyen. 

Similarly, several sentences inta the las^t paragraph of the Advisojcy Actipn state a Similar 

characierization of the teachings of Hon: 

''llius, Hon is not criticizing the use x)f phoiienies (which are essential [^sic] a unit of 
sound) but rather the process to select or match them through diphones (which is pair or 
transition of phonemes)." 

Again, il appeatrs- in this language of the Advisory Action that the Examiner misconstrues 
the reference to diphones in Hon et ai. Again, 'the large distortions due^^to spectra 
between the stationary parts of two units obtained through different context as thQ identific?d 
problem with using diphones as the syntheisis unit ih Hon et- al. does not relate to the process of 
selecting or matching phonemes in avdatabascw Applicants certainly note that Hon et al; does. not 
criticize the use of phonemes in general, but rather criticizes the use of diphories as the synthesis 
unit in a unit selection process. Thi& is the very reaspn-(in^onnection vyith theifact the Hon et al.. 
fails 10 teach anything' regarding: visual synthesis) that one of skill in the art.would less likely to 
modify the teachings of Ezzat et al. with Hon et al. because Hon et al. highlight.the benefits of 
using diphones ais their synthesis unit . 

Next, the Advisory Action recites- some the> arguments in previous Office Actioris.tb 
articulate. the enhancements in the teeliBOlogy of Ezzalt et ai, which the Examiner asserts vi^ould. 
be enhanced by the teachings of Hon et.al. In several, places, the Examineir relVences page 4 of 
the previous Office Action mailed.vout iS/26/2007. In this portion of the Final Office Action, the 
Examiner asserts "Ezzat does not.teaCh the;elaimcd ' unit selection process and^pes not teach 
the claimed Mn w>hich a longest possible candidate image sarnple is selected ''^ The :Office 
Action then asserts that Hon et al. teach a unit selection process by teaching of ;"unit selection- ^ 
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and suggests the claimed limitatign of the longest possible candidate image sample being 
selected. 

Applicants respectfully traverse this^analysis and note that the basic concept of a unit 
selection process in text-to-speech is merely the selection of units of speech that are then 
concatenated together to produce the synthesized speech. In this regard, we note-that Ezzat et al. 
teach a text4o-visual speech (TrVS) synthesis system which includes a discussion of 
concatenating diphones together, SeeiSection 7, first paragraph. Applicants note that while 
Ezzat et al. does not necessarily expressly teach "a unit selection process" one of skill in the art 
wou ld already understand the basic process of selecting units, (or diphones) and concatenate them 
together as is done in the Festival TtS System,: tbf example, it is easy to fmd oti theantemet 
references to the Festival TTS system that uses unit selection. Accordingly, the concept asserted 
in the Office Action that because Ezzat et al. fail to teach "unit selection", one of skill in the.art 
would have motivation to utilize the .unit selection featuresTrom Hon et al. is erroneous because 
one of skill in the art would already recognize, and be familiar with the Festival TTS system 
reference by Ezzat et al, and already understand that such basic system, already uses unit 
selection . Accordingly, such a person of skill in the art would not go searching through other 
references such as Hon et al. for the purpose of enhancing the teachings of Ezzat et al. with a unit 
selection process. The Advisory Action states that "this improvement (the unit selection 
approach of Hon et al. )" in the selection process to match phonemes isahe very reason why Hon 
is combined with Ezzat. Inasmuch as AppliQants have demonstrated that Jhe reaS;On why the 
Examiner asserts one of skill in the art would combine Hon et.al. with Ezz.at et|aL is insulficient 
because Ezzat et al. already inherentl|y has that feature as woiild be known. 

The Advisory Action also states "further, the motivation provided; for combining the prior 
art. is the improvement in the decision or matching process that Hon offers (bottorn of page 4 in 
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Office Action) through their decision tree.;' On the bottom of page 4 of the final Office Action, 
the Examiner states that the advantage "to the combination is. that with H6n, unit selection 
features selected from a database of a. large amount of candidates can produce optimal 
concatenation quality as is mentioned in the first partial paragraph of page 296 of Hon et al. 

Applicants again strongly traverse the conclusion that Hon et al would represent an 
enhancement over the technology of Ezzat et al. such that one of skill in the art- would be 
motivated to combine these references. Applicants strongly note that the Exariiiner has failed to 
analyze Applicants core argument. This argument, as was referenced briefly above, relates to the 
fact that only Ezzat et al. relate, to , audio.as vvell as video streams in their visual speech synthesis, 
approach. As Applicants have previously discussed, Ezzat et al in Section 7 utilize the Festival 
TTS system that constructs a final audio stream by concatenating diphohes together. (First 
paragraph of Section 7.) As is hoted in the- |ast paragraph of Sectioh 7,Ezzat et al. highlight that 
they ''have found the use of TTS jtiming.and phonemic infoiTnation in r/i/.y: manner produces very 
good quality lip synchronization b^etween the audio and video.*' (Emphasis added.) The, TTS 
timing and phonetic information is gained through the use of diphones as the selection unit 
which is "this"' manner bf selection. AlsC^at theend of the last paragraph of Section 7, Ezzat et 
al. note that one of the reasons that diphones are advantageous is that "when a viseme transition 
is oversampled, the corresponding;audio diphorie is lengthened to ensure that synchrony between 
audio and video is maintained." Again, Applicants basic point is that the basic; unit that is found 
to produce "very good" lip synchronization between the audio and the. video ini Ezzat et al. is the 
diphone. Thus, while the Examiner has introduced the concept thai there are '^deficiencies or 
problems in the selection process-' of Ezzat et al., Applicants submit that there is no suggestion 
within the teachings of Ezzat et al. that using diphones as the unit of selection produces any sort 
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of deficiency. The suggestion from Ezzat et aL is in tact the opposite, lha;t using diphones as the 
selection unit is optimal when synehronizing audio with video. 

Furthermore, AppHcants again submit that it would certainly not be obvious to 
incorporate the teachings of the selection-prQcess of tfpn et al. that abandons the diphone a$ the 
basic selection unit and requires the use of a decision tree cluster phone-based unit which may be 
one of atriphone, quinphone/stress^serisitive phone, wofdrdependeht phone.or a cbmbihation bf 
the above, (See first paragraph. Section 2,3) One reasoh why one of skill in the ait would not be 
likely or motivated to incorporate this particular process into the visual speech synthesis process 
of Ezzat et al. is that there may be a myriad of challenges that would be introduced into a visual 
synthesis approach when it comes to. maintaining^ very good quality lip synchronization between 
the audio and the video." For example^ how would one of skill in the art have to modify the 
opportunity to deal with the situation identified at the end of Section 7 of Ezzat et al. related to 
when a visual viseme transition is oversampled? Is it possible to take a iCprresponding audio 
triphone, quinphone, stress-sensitive phone, or word-dependent.phone and sirnply-lengthen it to 
maintain a synchronous relationship between the audio and video? Such a requirement would 
certainly require further research and thought and is not simply cannot be a matter of replacing 
diphones as selection units with a completely different decision tree clustered phone-based unit. 
Inasmuch as Hon ct al. fail to teach an>nhing regarding visual synthesis, there is nothing in Hon 
et al. that would suggestion that its approach would provide a benefit to the teachings of Ezzat et 
al. and may more likely introduce synchronization problems into Ezzat et al. i 

Applicants therefore respectfully submit that by a preponderance of the; evidence; 
Applicants have the weightier arguments against there being sufficient motivation or suggestion 
to combine these references, Apphcants have corrected foundational assertions vyithin the 
Advisory Action that cause the conclusions to become much less persuasive inasiiuich as they 
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are based on a technically incorreet chamcterization of th teachings of either Hon et al or Ezzat 
et al. (i.e., Hon et al. does not deal \vith audJb "as well as video streams" and Hon et al. does not 
identity any deficiencies or problems in their diphone selection process,, but rather highlight it. as 
producing very good quality lip synchronization). 

Accordingly, Applicants fespectl\jlly submit that the preponderance of the evidence is 
against the combination of these refe^^^ Therefore,. Applicaiits requisst a Notice of 
Allowance. 
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CONGLUSIOM 

Having addressed all rejections and Objections, Applicants respectlxilly submit that the 
subject application is in condition for allowance and a Notice to that effect is earnestly sQlicited. 
If necessar}^ the Commissioner for Patents is authorized to charge or credit the; Novak, Druce & 
Quigg, LLP, Account No. 14-1437 for any deficiency or overpayment . 



Date: September 25, 2007 

Correspondence Address: 
Thomas A. Restaino 
Reg. No. 33,444 
AT&^r Corp. 
Room2A-207 
One AT&T Way 
Bedmiiisier, NJ 07921 



Respectfully submitted, 



By: 




Thomas M. Isaacson 

Attorney for Applicants. 
Reg.No.44iI66 
Phone: 4lX)-286-94,05 
Fax No. : 410-510-1433 
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